NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Faster Neighborhood Attention: Reducing the O(n^2 ) Cost of Self Attention at the Threadblock Level

Hassani, Ali; Hwu, Wen-mei; Shi, Humphrey (December 2024, NeurIPS 2024)

Neighborhood attention reduces the cost of self attention by restricting each token’s attention span to its nearest neighbors. This restriction, parameterized by a window size and dilation factor, draws a spectrum of possible attention patterns between linear projection and self attention. Neighborhood attention, and more generally sliding window attention patterns, have long been bounded by infrastructure, particularly in higher-rank spaces (2-D and 3-D), calling for the development of custom kernels, which have been limited in either functionality, or performance, if not both. In this work, we aim to massively improve upon existing infrastructure by providing two new methods for implementing neighborhood attention. We first show that neighborhood attention can be represented as a batched GEMM problem, similar to standard attention, and implement it for 1-D and 2-D neighborhood attention. These kernels on average provide 895% and 272% improvement in full precision runtime compared to existing naive CUDA kernels for 1-D and 2-D neighborhood attention respectively. We find that aside from being heavily bound by memory bandwidth, certain inherent inefficiencies exist in all unfused implementations of neighborhood attention, which in most cases undo their theoretical efficiency gain. Motivated by the progress made into fused dot-product attention kernels, we developed fused neighborhood attention; an adaptation of fused dot-product attention kernels that allow fine-grained control over attention across different spatial axes. Known for reducing the quadratic time complexity of self attention to a linear complexity, neighborhood attention can now enjoy a reduced and constant memory footprint, and record-breaking half precision runtime. We observe that our fused implementation successfully circumvents some of the unavoidable inefficiencies in unfused implementations. While our unfused GEMM-based kernels only improve half precision performance compared to naive kernels by an average of 548% and 193% in 1-D and 2-D problems respectively, our fused kernels improve naive kernels by an average of 1759% and 958% in 1-D and 2-D problems respectively. These improvements translate into up to 104% improvement in inference and 39% improvement in training existing models based on neighborhood attention, and additionally extend its applicability to image and video perception, as well as other modalities.
more » « less
Full Text Available
SSDTrain: An Activation Offloading Framework to SSDs for Faster Large Language Model Training

https://doi.org/10.1109/DAC63849.2025.11132754

Wu, Kun; Park, Jeongmin Brian; Zhang, Xiaofan; Hidayetoğlu, Mert; Mailthody, Vikram Sharma; Huang, Sitao; Lumetta, Steve; Hwu, Wen-Mei (June 2025, IEEE)

The growth rate of the GPU memory capacity has not been able to keep up with that of the size of large language models (LLMs), hindering the model training process. In particular, activations—the intermediate tensors produced during forward propagation and reused in backward propagation—dominate the GPU memory use. This leads to high training overheads such as expensive weight update costs due to the small micro-batch size. To address this challenge, we propose SSDTrain, an adaptive activation offloading framework to high-capacity NVMe SSDs. SSDTrain reduces GPU memory usage without impacting performance by fully overlapping data transfers with computation. SSDTrain is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor deduplication and forwarding to further enhance efficiency. We extensively experimented with popular LLMs like GPT, BERT, and T5. Results demonstrate that SSDTrain reduces 47% of the activation peak memory usage. At the same time, SSDTrain perfectly overlaps the I/O with the computation and incurs negligible overhead. Compared with keeping activations in GPU memory and layerwise full recomputation, SSDTrain achieves the best memory savings with negligible throughput loss. We further analyze how the reduced activation memory use may be leveraged to increase throughput by increasing micro-batch size and reducing pipeline parallelism bubbles.
more » « less
Free, publicly-accessible full text available June 22, 2026
Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level

https://doi.org/10.52202/079017-2065

Hassani, Ali; Hwu, Wen-mei; Shi, Humphrey (January 2024, Neural Information Processing Systems Foundation, Inc. (NeurIPS))

Full Text Available
RackBlox: A Software-Defined Rack-Scale Storage System with Network-Storage Co-Design

https://doi.org/10.1145/3600006.3613170

Reidys, Benjamin; Xue, Yuqi; Li, Daixuan; Sukhwani, Bharat; Hwu, Wen-Mei; Chen, Deming; Asaad, Sameh; Huang, Jian (October 2023, ACM)

Full Text Available
An efficient GPU implementation and scaling for higher-order 3D stencils

https://doi.org/10.1016/j.ins.2021.11.042

Anjum, Omer; Almasri, Mohammad; de Gonzalo, Simon Garcia; Hwu, Wen-mei (March 2022, Information Sciences)

Full Text Available
Open Relation Modeling: Learning to Define Relations between Entities

https://doi.org/10.18653/v1/2022.findings-acl.26

Huang, Jie; Chang, Kevin; Xiong, Jinjun; Hwu, Wen-mei (January 2022, Findings of the Association for Computational Linguistics: ACL 2022)

Full Text Available
Exploring HW/SW Co-Design for Video Analysis on CPU-FPGA Heterogeneous Systems

https://doi.org/10.1109/TCAD.2021.3093398

Zhang, Xiaofan; Ma, Yuan; Xiong, Jinjun; Hwu, Wen-mei; Kindratenko, Volodymyr; Chen, Deming (June 2021, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)
null (Ed.)
Full Text Available
Measuring Fine-Grained Domain Relevance of Terms: A Hierarchical Core-Fringe Approach

https://doi.org/10.18653/v1/2021.acl-long.282

Huang, Jie; Chang, Kevin; Xiong, JinJun; Hwu, Wen-mei (January 2021, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing)

Full Text Available
Parallelizing Maximal Clique Enumeration on GPUs

https://doi.org/10.1109/PACT58117.2023.00022

Almasri, Mohammad; Chang, Yen-Hsiang; Hajj, Izzat El; Nagi, Rakesh; Xiong, Jinjun; Hwu, Wen-mei (October 2023, IEEE)

Full Text Available
FReaC Cache: Folded-logic Reconfigurable Computing in the Last Level Cache

https://doi.org/10.1109/MICRO50266.2020.00021

Dhar, Ashutosh; Wang, Xiaohao; Franke, Hubertus; Xiong, Jinjun; Huang, Jian; Hwu, Wen-mei; Kim, Nam Sung; Chen, Deming (October 2020, IEEE/ACM International Symposium on Microarchitecture (MICRO))
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records